Improving long-term multivariate time series forecasting with a seasonal-trend decomposition-based 2-dimensional temporal convolution dense network

Improving the accuracy of long-term multivariate time series forecasting is important for practical applications. Various Transformer-based solutions emerging for time series forecasting. Recently, some studies have verified that the most Transformer-based methods are outperformed by simple linear models in long-term multivariate time series forecasting. However, these methods have some limitations in exploring complex interdependencies among various subseries in multivariate time series. They also fall short in leveraging the temporal features of the data sequences effectively, such as seasonality and trends. In this study, we propose a novel seasonal-trend decomposition-based 2-dimensional temporal convolution dense network (STL-2DTCDN) to deal with these issues. We incorporate the seasonal-trend decomposition based on loess (STL) to explore the trend and seasonal features of the original data. Particularly, a 2-dimensional temporal convolution dense network (2DTCDN) is designed to capture complex interdependencies among various time series in multivariate time series. To evaluate our approach, we conduct experiments on six datasets. The results demonstrate that STL-2DTCDN outperforms existing methods in long-term multivariate time series forecasting.


Methods for time series forecasting
Given the time series prediction tasks hold paramount significance in real world applications, numerous methods have been meticulously developed.Many traditional time series prediction models begin with statistical methods 33,34 .ARIMA 7 adopts the Markov process to constructs the autoregressive model for iterative sequential forecasting.Nevertheless, an autoregressive process is incapable when handling complex sequence with nonlinearity and non-stationarity.As the evolution of neural networks over the past decades, RNN 9 has been specially designed for processing sequential data.To address the challenge of gradient vanishing, many studies propose various modifications of RNN such as LSTM 12 and GRU 13 .TCN 14 is a neural network that employs dilated causal 1D convolution layers tailored for 1D data.However, as the convolution kernel with limited receptive field, the vanilla TCN is unable to explore interdependencies among various time series in multivariate time series data.
As a single model may fall short in learning complex features, some works pay attention to integrate various methods to one framework and present many ensemble models for time series forecasting.Ensemble models have been used in many practical applications successfully, such as traffic forecasting 17 , financial forecasting 18 , and energy management 35,36 .Moreover, ensemble learning methods hold robustness and reliability.Therefore, ensemble models have advantages in the field of medical applications 37,38 .
With the Transformer's impact on natural language processing and computer vision in recent years, there has been a surge in discussions, adaptations, and applications of Transformer-based solutions in time series forecasting.As for network modifications, various adaptations of Transformer for time series can be summarized into two levels: architectural and modular 39 .Some approaches, including Informer 24 , Autoformer 25 , and FEDformer 26 , modify the vanilla positional encoding of Transformer to leverage timestamps of time series and redesign the attention calculation methods to reduce complexity.Besides adapting individual modules within Transformer for time series modeling, approaches like Informer 24 and Pyraformer 40 aim to reconfigure Transformer at the architectural level.Notably, recent studies by Zeng et al. 30 and Das et al. 31 have demonstrated that linear models possess a strong capability for temporal relation extraction.In many cases, these linear models outperform most Transformer-based models in the area of time series forecasting.

Decomposition of time series
Entangled temporal features in multivariate time series forecasting present significant challenges when it comes to effectively exploring local and long-range dependencies among time points and variables 41 .Many methods identify temporal dependencies with entangled temporal patterns, but they often can hardly fully leverage the inherent complex features of time series data, such as seasonality and trends.Therefore, various studies adopt time series decomposition to analyze time series.These decomposition methods can be divided into three categories: frequency domain decomposition, time domain decomposition, and time-frequency domain decomposition.The Fourier transform (FT) 25,26 is a widely-recognized frequency domain decomposition technique in time series analysis.FT and its modifications can transform an original sequence from time domain to frequency domain, but they ignore trends shifts of time series.The STL is an important time domain decomposition method, which can effectively decompose a time series data into three distinct subseries.These three components represent different underlying categories of patterns that exhibit higher predictability.Wavelet transform (WT) 42,43 and empirical wavelet transformation (EWT) 44,45 are time-frequency domain decomposition methods that are particularly well-suited for the analysis of non-stationary series, as they can provide enhanced local time-frequency information.

Methodology
We begin by introducing the formulation representation of multivariate time series forecasting.Subsequently, all involved components and the architecture of the proposed STL-2DTCDN model are presented.Finally, we detail the objective function and the evaluation metric employed for model training.

Problem statement
The formulation is articulated as follows: Given one historical time data denoted as Y , where c (c > 1) denotes the number of variates, y t i is the predicted result of the i th variate at the time step t , and H (H > 1) denotes the number of forecasting time steps.The ground truth for the time period from L + 1 to L + H is denoted as Y L+1:L+H = y t 1 , y t 2 , ..., y t c L+H t=L+1 . Long-term multivariate time series forecasting aims to forecast Y with a larger value of H ( H ≫ 1).Multi-step forecasting can be categorized into two types: iterated multi-step (IMS) 46 forecasting and direct multi-step (DMS) 47 forecasting.IMS forecasting iteratively predicts each time step, but it suffers from error accumulation effects.Compared to IMS forecasting, DMS forecasting can directly learn all prediction results at once.Consequently, DMS forecasting can outperform IMS forecasting in long-term forecasting tasks.

Seasonal-trend decomposition based on loess (STL)
STL is an effective approach that can decompose an original time sequence data into three different subseries, which can be formulated as: where t = 1, 2, ..., n represents time steps, Y t denotes the original time series data, T t , S t , and R t represents the trend, seasonal, and residual components, respectively.In contrast to traditional decomposition methods, STL provides much more robust components for effectively decomposing time series sequence, especially in the presence of outliers.STL methodology consists of two iterative processes, known as the inner loop and the outer loop.Seasonal smoothing and trend smoothing during a single iteration are conducted in the inner loop, updating both the seasonal and trend components.Suppose T t k and S t k are the trend and seasonal components at the end of the k th iteration of the inner loop, respectively.Steps of computing T t k+1 and S t k+1 for the (k + 1) th inner loop are detailed as follows: Step After finishing the inner loop, the initial sequence is decomposed into the trend elements and the seasonal elements, the residual elements are calculated by in the outer loop: t .The parameters of STL were explored in previous experiments 41 .In this study, we set relevant parameters refer to the recommended defaults.Figure 1 presents the decomposition results of STL with the default parameter values, using data from the Centers for Disease Control and Prevention of the United States. (1)

2-Dimensional temporal convolution dense network (2DTCDN)
TCN is an effective approach proposed for modeling long sequence.Different from traditional RNN, TCN leverage the concept of CNN to explore complex dependencies in time sequences.The TCN architecture is presented in the Fig. 2, which consists of various layers, and an optional 1 × 1 convolution.Notably, dilated causal convolutions are used in TCN to increase the receptive field, enabling the capture of features at different time scales in time series data.For one 1-D sequence X ∈ R M , and the filter K d with dilation rate d , the operation of the dilated causal convolution is defined as where X(t) is the t th element of the output processed by the dilated causal convolution, X(t − (d • τ )) represents the (t − (d • τ )) th element of the input sequence X , K d (τ ) denotes the τ th element of the filter, l is the length of the filter.
As illustrated in the Fig. 3, one framework of the dilated causal convolution with a filter size of l = 3 and dilation rate is set to d = 1, 2, 4.However, it's important to note that TCN's filter is one-dimensional (1D) and can only convolve along the time dimension of the time series.Consequently, TCN has limitations in capturing interdependencies among various time series in multivariate time series data.To better adapt TCN for multivariate time series forecasting tasks, we make some adjustments to the vanilla TCN and introduce the 2DTCDN.Figure 4 shows the architecture of 2DTCDN.
(2)  Causal convolution is an important concept, which limits that the output at time t is influenced by elements no later than t, ensuring that the future cannot influence the past.This concept is called information leakage, which is crucial in time series forecasting and it was initially proposed more than 30 years ago by Waibel et al. 48.To maintain consistent dimensionality with the input layer and enable convolutions, zero padding is applied in the hidden layers.However, since we are using 2D convolutional kernels in the proposed 2DTCDN, the padding method and convolution process differ from that of the 1D dilated causal convolution.For one 2D sequence X ∈ R M×N , and the 2D filter K d with dilation rate d, the operation of the 2D dilated causal convolution is for- mulated as  where X i, j is the (i, j) th element of the output processed by 2D dilated causal convolution, element of the 2D filter, h and w respectively denote the height and width of the 2D filter.
Figure 5 presents the padding of 2DTCDN, the kernel size is set to 3 × 3, dilation is set to 1.The padding of in the time dimension is similar to the 1D causal convolution, with the padding length calculated as filter_size−1 × dilation .In the feature dimension, the first 'padding length' features are duplicated and placed after the last feature to serve as padding data.
Dilated convolution is a variant of traditional convolutional operations used in deep learning 49 .In a standard convolution, a filter slides over the input data with a fixed stride, and each weight in the filter interacts with a neighboring input pixel.Different from the standard convolution, dilated convolution introduces gaps between the weights of the filter, allowing it to capture information from a broader receptive field while retaining the original resolution.For example, Fig. 6 illustrates a 2D dilated causal convolution process, the filter size is set to 3 × 3, dilation and stride are both set to 1.

Residual connections
Residual connections are a fundamental architectural component in deep neural networks proposed by He et al. 50.These connections are employed to mitigate the vanishing gradient.The main idea behind residual connections is to add the skip connection between different layers, creating a shortcut path for the gradient during backpropagation.We adopt the dense layer as the residual connection in the proposed 2DTCDN. (3)

Dense layer
Recent studies by Zeng et al. 30 and Das et al. 31 have demonstrated the remarkable capabilities of simple linear models in time series prediction tasks.Essentially, only one simple one-layer linear model can explore complex interdependencies within the sequence data effectively, allowing the neural network to explore intricate features that are important for the task.In our proposed 2DTCDN, we integrate the residual block of TiDE with 2D dilated causal convolution.

Overview of the STL-2DTCDN framework
Figure 7 illustrates the entire framework of the STL-2DTCDN.We use F t to denote the time features at time step t.These time features include the holidays, the day of week, or other specific to a particular time step.The time series are first decomposed into three sub-series: trend ( T t ), seasonal ( S t ), and residual ( R t ) using STL.Subsequently, these three sub-series, concatenated with time features, are separately processed by an encoder architecture with the 2DTCDN block.Next, the processed sub-series are concatenated to form the input data of the next decoder architecture.Finally, outcomes of the decoder layer concatenated with time features are delivered to a dense layer to generate the predictions.

Objective function
The squared error is a loss function frequently employed in time series prediction tasks.It evaluates the squared differences between the real values and predicted values.The optimization objective is formulated as: where Train denotes a set of training time steps, t represents the time step, H represents the horizon of predic- tion, c is the number of subseries, F is the Frobenius norm.

Evaluation metrics
Mean absolute error (MAE) and mean square error (MSE) are adopted as the evaluation metric to assess the performance models.they are defined as:

Experiments Datasets
The proposed STL-2DTCDN is tested on six datasets.All datasets are split into three segments in a chronological order: training, validation, and test sets, with a split ratio of 7:1:2 for Traffic and Electricity.ETT dataset are split with the ratio of 6:2:2, as recommended by Informer 24 and Autoformer 25 .Table 1 presents statistical information of the datasets.
• Traffic 30 collects data from the California.

Methods for comparison
At present, deep learning-based methods are the predominant approach in time series forecasting.We select six baseline methods for comparison with the STL-2DTCDN.These selected baseline methods including three categories: the TCN 14 model, the Transformer-based methods (Informer 24 , Autoformer 25 , PatchTST/64 27 , and FEDformer 26 ), and the linear models (DLinear 30 and TiDE 31 ).TCN is designed for processing sequence data, and the adoption of causal convolution enhances its capacity in exploring dependencies of long-term.Transformerbased methods have made great success in time series forecasting tasks recently.In addition, linear models have been demonstrated that they can achieved promising results in various forecasting tasks.The TiDE conducts experiments with the fixed look-back window 720 for all prediction lengths.Other compared models set the look-back windows as recommended.The results of baseline methods are reported from TiDE 31 and PatchTST 27 .

Experimental settings
The proposed STL-2DTCDN is trained using the L2 loss function and optimized with the ADAM 51 , initialized with a learning rate of 10 -4 .The look-back window for prediction lengths {96, 192, 336, 720} is all set to 720, following the TiDE.The batch size is set to 32 for training and experiments are repeated five times.Experiments are conducted using two NVIDIA GeForce RTX 2080 Ti GPUs, with the implementation in PyTorch.Table 2 presents the range of involved hyper-parameters.We tune these hyper-parameters by leveraging the rolling validation error on the validation dataset.
Since the size of the look-back window is significantly different from the number of subseries for all datasets, the two dimensions of the convolution kernel are designed separately for the time and feature dimension.Specifically, considering the varying number of subseries in different datasets, such as ECL and Traffic with a larger number of time series, we utilize larger parameters [4, 8, 12, 16] in the feature dimension, while for datasets

Experimental results
MAE and MSE of the proposed STL-2DTCDN and compared methods on six practical datasets are shown in Table 4.Each row in the table corresponds to a comparison of results within a specific window horizon, and each column represents the results of a particular model in all cases.The values that highlighted in bold are best results.Since all models are trained with the squared error, let's concentrate on the column of MSE column for comparisons.From Table 4, we can observe that: (1) The STL-2DTCDN can achieve the best results in most cases (as indicated by the count of the best results in the last row).( 2) The STL-2DTCDN shows better performance than TiDE, and the MSE decreases by 3.2% (at 96), 5.7% (at 192), 5.8% (at 336), 6.9% (at 720) in average.The longer the prediction horizon, the better STL-2DTCDN performs.This suggests that STL-2DTCDN is more suitable for long-term forecasting.(3) Our proposed model achieves significantly better results than TiDE in large datasets for long-term forecasting.The MSE decreases by 10.1% (at 720) and 13.8% (at 720) for the Traffic dataset and the Electricity dataset, respectively.However, for four ETT datasets with the prediction length of 720, the STL-2DTCDN achieves only a 4.4% decrease in MSE on average compared to TiDE.We believe that this is because the Traffic and Electricity datasets have a significantly larger number of time series compared to four ETT datasets.Consequently, the 2DTCDN can utilize a larger kernel size in the feature dimension to better explore interdependencies among different time series.Figures 8 and 9 present the comparison of forecasting results calculated by STL-2DTCDN and TiDE with the ground truth for the Traffic and Electricity datasets.From these figures, we observe that the STL-2DTCDN performs better in capturing the temporal repeating patterns and fitting the trend of the curve.This indicates that the STL-2DTCDN effectively captures the temporal characteristics of the data sequences, including seasonality and trends.

Ablation studies
The contribution of each involved component of STL-2DTCDN is figured out by ablation studies.Specifically, each component is removed in turn from the STL-2DTCDN, and we evaluate the performance of each subframework consists of the remaining components.Each sub-framework is detailed as follows: • Re/STL: Remove the STL from the originally proposed STL-2DTCDN.
• Re/Time features: Remove the Time features from the originally proposed STL-2DTCDN.
• STL-2DTCDN → STL-TCN: 2DTCDN is replaced with a vanilla TCN.Table 5 presents the performance of the original STL-2DTCDN and sub-frameworks by removing each component.It can be observed from the Table 5 that the combination of STL, 2DTCDN, and Time features delivers the most precise forecasts in different datasets, and the removal of any single component leads to a decline in performance.Furthermore, we also substitute the 2DTCDN with a standard TCN, and the forecasting results demonstrate that 2DTCDN can achieve better performance in long-term multivariate prediction tasks.

Conclusions
We design a STL-2DTCDN model for long-term multivariate time series forecasting in this paper.STL-2DTCDN utilizes STL to decompose the original time sequence into three subseries.Time features is used to add additional covariates to the model.Furthermore, we adapt the vanilla TCN and introduce the 2DTCDN for long-term multivariate time series forecasting.Compared to various Transformer-based methods and linear models, the STL-2DTCDN exhibits strong capabilities in capturing various temporal patterns and exploring complex interdependencies between different related subseries for long-term multivariate time series forecasting.In the next stage, we will concentrate on: (1) interpreting the model outputs and understanding how a deep neural network achieves its forecasting results; (2) exploring alternative approaches to enhance the capability of capturing temporal patterns and exploring complex interdependencies inhered in multivariate time series.

Figure 1 .
Figure 1.The decomposition results of STL.(a) Original data.(b) Decomposition of Original Data.
Deseasonalizing.Subtract the seasonal elements from the original sequence Y t to get the deseasonal- ized time series Y detrend t = Y t − S k+1 t ; Step 6: Trend smoothing.The trend component T t k+1 is obtained by smoothing Y detrend t the with a Loess smoother.

2 Table 1 .
Statistical information of six datasets.

Table 3 .
Selected hyper-parameters for six datasets.

Table 4 .
Results of long-term multivariate time series forecasting on six datasets.